Goto

Collaborating Authors

 training step


BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

arXiv.org Machine Learning

Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.


Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

arXiv.org Machine Learning

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.



Appendix ANetwork Architectures

Neural Information Processing Systems

In this section, we describe the details of the network architectures used in Sec. 4 and 5. We mainly used 4 GPUs (NVIDIAV100; 16GB) for the experiments in Sec. 4 and 5 and it took about 4 hours per seed (in the case of 3M steps). Actually, we conducted exhaustive evaluations through the enormous experiments, and we hope our empirical observations and recommendations help the practitioners to explore the explosive configuration space. Adam Adam Learning rate (policy) 1e-4 5e-5 3e-4 3e-4 Learning rate (value) 1e-4 1e-2 3e-4 3e-4 Weight initialization Uniform Xavier Uniform Xavier Uniform Xavier Uniform Initial output scale (policy) 1.0 1e-4 1e-2 1e-2 Target update Hard - Soft (5e-3) Soft (5e-3) Clipped Double QFalse - True True Table 7: Details of each network architecture. We refer the original implementations of each algorithm which is available online [23, 14, 48, 27, 42].





Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

arXiv.org Machine Learning

Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection--mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection--mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann--Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator--mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.